Corpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research
نویسنده
چکیده
This paper poses the question, how linguistic corpus-based research may be enriched by the exploitation of conceptual text structures and layout as provided via TEI annotation. Examples for possible areas of research and usage scenarios are provided based on the German historical corpus of the Deutsches Textarchiv (DTA) project, which has been consistently tagged accordant to the TEI Guidelines, more specifically to the DTA ›Base Format‹ (DTABf). The paper shows that by including TEI-XML structuring in corpus-based analyses significances can be observed for different linguistic phenomena, as e.g. the development of conceptual text structures themselves, the syntactic embedding of terms in certain conceptual text structures, and phenomena of language change which become obvious via the layout of a text. The exemplary study carried out here shows some of the potential for the exploitation of TEI annotation for linguistic research, which might be kept in mind when making design decisions for new corpora.
منابع مشابه
Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus
It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagse...
متن کاملComputer-Assisted Processing of Intertextuality in Ancient Languages
The production of digital critical editions of texts using TEI is now a widely-adopted procedure within digital humanities. The work described in this paper extends this approach to the publication of gnomologia (anthologies of wise sayings), which formed a widespread literary genre in many cultures of the medieval Mediterranean. These texts are challenging because they were rarely copied strai...
متن کاملCorpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora
The Corpus Encoding Standard (CES) is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen and Burnard, 1994). It provides encoding conventions for linguistic corpora designed to be optimally suited for use in language engineer...
متن کاملAutomatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene
The paper describes a tool developed to process historical (Slovene) text, which annotates words in a TEI encoded corpus with their modern-day equivalents, morphosyntactic tags and lemmas. Such a tool is useful for developing historical corpora of highly-inflecting languages, enabling full text search in digital libraries of historical texts, for modernising such texts for today's readers and m...
متن کاملThe DTA 'base format': A TEI-subset for the compilation of interoperable corpora
This article describes a strict subset of TEI P5, the DTA ‘base format’, which combines the richness of encoding noncontroversial structural aspects of texts while allowing only minimal semantic interpretation. The proposed format is discussed with regard to other commonly used XML/TEI schemas. Furthermore, the article presents examples of good practices showing how external corpora can either ...
متن کامل